-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-46598][SQL] OrcColumnarBatchReader should respect the memory mode when creating column vectors for the missing column #44598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
LuciferYang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to use ConstantColumnVector for the missingCol? This maybe another story.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems simpler to use ConstantColumnVector here. I've updated the PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[info] - SPARK-28156: self-join should not miss cached view *** FAILED *** (216 milliseconds)
[info] org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 1888.0 failed 1 times, most recent failure: Lost task 1.0 in stage 1888.0 (TID 2251) (localhost executor driver): java.lang.NullPointerException: Cannot invoke "org.apache.spark.internal.config.ConfigReader.get(String)" because "reader" is null
[info] at org.apache.spark.internal.config.ConfigEntry.readString(ConfigEntry.scala:94)
[info] at org.apache.spark.internal.config.FallbackConfigEntry.readFrom(ConfigEntry.scala:270)
[info] at org.apache.spark.sql.internal.SQLConf.getConf(SQLConf.scala:5573)
[info] at org.apache.spark.sql.internal.SQLConf.offHeapColumnVectorEnabled(SQLConf.scala:5103)
[info] at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$2(OrcFileFormat.scala:200)
[info] at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:218)
some tests failed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, this part fails.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @cloud-fan . Could you fix the test failures?
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. This looks much nicer. Thanks!
LuciferYang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM
Seems if using |
…olumn vectors for the missing column
|
I changed it back. It seems non-trivial to make |
| assert(supportBatch(sparkSession, resultSchema)) | ||
| } | ||
|
|
||
| val memoryMode = if (sqlConf.offHeapColumnVectorEnabled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved it outside of the lambda, so that we don't hit NPE by referencing sqlConf.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you re-trigger the failed pipelines?
…ode when creating column vectors for the missing column This PR fixes a long-standing bug that `OrcColumnarBatchReader` does not respect the memory mode when creating column vectors for missing columbs. This PR fixes it. To not violate the memory mode requirement No new test no Closes #44598 from cloud-fan/orc. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 0c1c5e9) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…ode when creating column vectors for the missing column This PR fixes a long-standing bug that `OrcColumnarBatchReader` does not respect the memory mode when creating column vectors for missing columbs. This PR fixes it. To not violate the memory mode requirement No new test no Closes #44598 from cloud-fan/orc. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 0c1c5e9) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
|
Thank you, @cloud-fan and all. |
…ode when creating column vectors for the missing column This PR fixes a long-standing bug that `OrcColumnarBatchReader` does not respect the memory mode when creating column vectors for missing columbs. This PR fixes it. To not violate the memory mode requirement No new test no Closes apache#44598 from cloud-fan/orc. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 0c1c5e9) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 53683a8) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
This PR fixes a long-standing bug that
OrcColumnarBatchReaderdoes not respect the memory mode when creating column vectors for missing columbs. This PR fixes it.Why are the changes needed?
To not violate the memory mode requirement
Does this PR introduce any user-facing change?
No
How was this patch tested?
new test
Was this patch authored or co-authored using generative AI tooling?
no